Today, the number of people killed in road crashes around the world continues to increase. According to the World Health Organisation’s “Global Status Report on Road Safety”, it reached 1.35 million in 2016 alone. This means that, worldwide, more people die as a result of road traffic injuries than from HIV/AIDS, tuberculosis or diarrhoeal diseases. And road crashes are now the most common cause of death for children and young people between 5 and 29 worldwide.
Even the significant improvement in the situation in the field of road accidents (especially severe ones in terms of human fatality, traffic delay, property damage) in the "developed" countries over the last decades has not led to the complete elimination of the phenomenon as the mindset of a required "Vision Zero" still remain at stake (in Europe alone, car accidents kill weekly as many people as fit into a jumbo jet, and as we do not accept deaths in the air, we should no longer accept them on the road).
The City of Seattle Government has long aligned its road safety policies with the framework and the targets UN and WHO have adopted approaching the issue in terms of "Safe System". The core elements of this approach are ensuring safe vehicles, safe infrastructure, safe road use (speed, sober driving, wearing safety belts and helmets) and better post-crash care.
But the end of the UN "Decade of Action for Road Safety" (2010-2020) happens to coincide with the moment that to the above systematic approach could be added tools that the modern capacities of data science offer today. Wouldn't it be great if we could exploit their well-established algorithms and available data for an extra, case tailored this time, preventive approach that could warn both drivers and traffic services, about the possibility of a car accident given some objective and subjective conditions and factors? Or wouldn’t it be useful to predict how severe this accident would be so that a driver would drive more carefully or change travel route, and traffic services prepare their response? Well, this potential promising contribution of machine learning to the global debate on road safety is the object of this project in which the long-established elements of the Safe System approach will be used as predictors for a machine learning model able to predict accident "severity".
The Seattle Department of Transportation (SDOT: the municipal government agency in Seattle, Washington that is responsible for the maintenance of the city's transportation systems, including roads, bridges, and public transportation) has asked us, on behalf of its Response Team as well as the Seattle Police Department (SPD), to build a ML model that will help them to a better perception-prediction of the risk of a severe road accident if we know and can evaluate quantitatively or qualitatively the conditions and circumstances of the municipal road network.
It should be noted that this is not a study which attempts to link, with an ML approach, the grid of all the causes and effects of a car accident. Rather, it is a study that attempts to connect specifically:
The dataset we will rely has been updated weekly by the SDOT Traffic Management Division, and its data come from the Seattle Police Department Traffic Records and record all types of collisions from 2004 to May 2020.
Τhe dataset is rich and contains many observations (rows) and various attributes (columns). Before wrangling, our observations (rows) are 194,673, most of which are good to train and test the machine learning model. Of course, this does not mean that we do not need to proceed to data cleaning and tidying, as well as in data balancing (otherwise we would create a biased ML model) since as expected severe accidents are significantly less than non-severe ones (more specifically severe ones are slightly less than 1/3 of the total).
| Attribute | Description |
|---|---|
| SEVERITYCODE | A code that corresponds to the severity of the collision:
|
| X | Accident Location's Longitude |
| Y | Accident Location's Latitude |
| OBJECTID INCKEY COLDETKEY REPORTNO STATUS ADRESSTYPE INTKEY |
no description |
| LOCATION | Description of the general location of the collision |
| EXCEPTRSNCODE | no description |
| SEVERITYCODE | Repeat of 1st column (label/target) |
| EXCEPTRSNDESC | no description |
| SEVERITYDESC | A detailed description of the severity of the collision |
| COLLISIONTYPE | Collision type |
| PERSONCOUNT | The total number of people involved in the collision |
| PEDCOUNT | The number of pedestrians involved in the collision. This is entered by the state |
| PEDCYLCOUNT | The number of bicycles involved in the collision. This is entered by the state |
| VEHCOUNT | The number of vehicles involved in the collision. This is entered by the state |
| INCDATE | The date of the incident |
| INCDTTM | The date and time of the incident |
| JUNCTIONTYPE | Category of junction at which collision took place |
| SDOT_COLCODE | A code given to the collision by SDOT |
| SDOT_COLDESC | A description of the collision corresponding to the collision code |
| INATTENTIONIND | Whether or not collision was due to inattention (Y/N) |
| UNDERINFL | Whether or not a driver involved was under the influence of drugs or alcohol |
| WEATHER | A description of the weather conditions during the time of the collision |
| ROADCOND | The condition of the road during the collision |
| LIGHTCOND | The light conditions during the collision |
| PEDROWNOTGRNT | Whether or not the pedestrian right of way was not granted (Y/N) |
| SDOTCOLNUM | A number given to the collision by SDOT |
| SPEEDING | Whether or not speeding was a factor in the collision (Y/N) |
| ST_COLCODE | A code provided by the state that describes the collision |
| ST_COLDESC | A description that corresponds to the state’s coding designation |
| SEGLANEKEY | A key for the lane segment in which the collision occurred |
| CROSSWALKKEY | A key for the crosswalk at which the collision occurred |
| HITPARKEDCAR | Whether or not the collision involved hitting a parked car (Y/N) |
| Metadata source |
Our dataset's first column is the labeled data which describes the fatality of an accident/collision taking only two values that correspond to "severe" and "not severe" (binary classification problem). Of the remaining 36 columns of the dataset (with either numerical or categorical types of data), not all of them are useful for our classifier building:
Let's first see if any of the selected variables have a significantly high number of missing values (converted into NaN values during csv's import to DataFrame format):
From the above we can easily conclude that there is no interest in using three more variables:
as the percentage of observations without such values is respectively 84.69%, 97.60% ans 95.21%.
Having dealt with the issue of variables with significantly high number of missing values, we should also deal with the missing values (NaN and 'Unknown') of the remaining ones.
Last but not least, let's check the unique values of each variable we retained in order to confirm we have ended up with a clean dataset in what concern both the numerical and categorical variables we will try to use for our model building:
It looks like we have to deal with two more issues:
Finally, before proceeding to our Exploratory Data Analysis and Model Building, let's convert INCDTTM into MONTH, WEEKDAY, and HOUR:
So we can say now that we have completed our data preparation tasks (cleansing and transforming raw data). Hence on, whatever transformation of our data will be imposed by the needs of their explotatory analysis and the per se feature selection/extraction and data preprocessing for the model buiding.
In this project we will direct our efforts on building a good classifier with which we will try to predict the severity (involving human injury) of a car accident, given some conditions and space and tie coordinates.
In first step we have collected, cleaned and prepared our data getting rid of variables whose interest rest out of the scope of this project.
Second step is our exploratory analysis which will mainly be based on some bivariate and correlation analysis, that will try to investigate the relationship between our variables. Here we will try to understand which of them would be the better features for our model, which should be transformed for that, and which should be considered as abundant. Next we will work on the spatial aspect of our problem. We will try from the simple X-Y accident's coordinates to extract some other spatial feature, useful for our model and practical for the stakeholders (SDOT, SPD). After that, we should of course take care of some preprocessing of our dataset, so as to be in form of beiing digested by the algorithms, we will balance our severely unbalanced dataset, and we will also proceed to train-test split.
After all, we will try to build and optimize our classification model, based on four algorithms:
Only from the plot it seems the WEEKDAY variable, at least as it is, will not play a major role in our models. So, we proceed already in its transformation from a 7-categories variable to a binary variable, that is weekdays(M-T) : 0 and long weekend (F-S) : 1, and we replot:
As previously, from the plot it seems the MONTH variable, at least as it is, will not play a major role in our models. So, we proceed already in its transformation from a 12-class variable into a 4-class variable, that is Winter(1,2,12), Spring(3-5), Summer(6-8) and Autumn (9-11), and we replot:
As previously, from the plot it seems the MONTH variable, at least as it is, will not play a major role in our models. So, we proceed already in its transformation from a 12-class variable into a 4-class variable, that is Winter(1,2,12), Spring(3-5), Summer(6-8) and Autumn (9-11), and we replot:
With our extended df, we will proceed to a correlation investigation of our variables between each other, and with the target variable. But since the majority of our data, always excluding X and Y coordinates, is categorical, we will use the Crammer's V Correlation and its heatmap.
From the above, we can safely conclude that we have to choose:
We decide will go respectiveley for DAY_PERIOD, MONTH and WEEKDAY, instead of HOUR, SEASON, WEEK_DAY/END. In all cases, we based our choise on which variable has less Crammer's V Correlation with the rest of the variables (we would also use the criterion which variable has more Crammer's V Correlation with the target variable (SEVERITYCODE) but in our case all six pairs' values are 0).
So it is time to deal with the spatial parameter of our problem. There are three ways to integrate the data of columns X and Y in the final table with which we will train our model:
1. The first option, not changing anything and feeding our model with the geographical coordinates of each accient, doesn't seem very efficient or helpful for the practical needs of the SDOT and SPD forces that will be overseeing the roads of the area and alerting for response wherever necessary. It's much more practical for a service to have areas with related characteristics (which have a higher or lower probability of a serious accident and require a specific level of protective and precautionary measures and vigilance) different from their neighboring ones. rouping in this way, we may lose in precision (since we would act/think with the notion of the "average") but we gain in abstraction as we have to worry about less details. And in fact, no one guarantees that we do not lose in accuracy. Even more, training the algorithm with such in detail differentiated observations will surely lead to over-fitting.
2. Regarding the clustering approach, it would be interesting in our case to try two diffent a clustering techniques:
But as we can see in our Appendix, both these methods don't lead to results that will contribute to the practicality aspect of our project as
3. 2-D Location Binning:
Simplest and in our case more efficient and usable than the previous approaches is to create a grid map based on meridians and parallels (this is why the grid - in mercatorian projection - ressembles leaning). We can even calculate the average SEVERITYCODE, practically the average probability of severe car accident for each one f the grid's blocks.
4. Distance from city center
Another approach is to group the accidents by distance from Seattle's city center, and calculate for this case as well, the average SEVERITYCODE (AVG(SEVERITYCODE)-1 ~ Severity Probability):
So, let's see a first time what the correlation between the two new spatial features is. To do so, we trust once more the Crammer's V Correlation:
From the above figure, an indisputable correlation of these two features becomes apparent. So, arriving at the stage of model-building, in the beginning, we will not use the two features at the same time. We will first try to train the models by including only one of the XY_bin and DISTANCE_bin. And since our primary goal is both to make predictions as well as to better understand the role of each independent variable, then we will try both at the same time, keeping in mind that collinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics.
The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important.
One approach to addressing imbalanced datasets is to oversample the minority class and we can find here simpler and more elaborated techniques (SMOT). In our case we will proceed by simply downsampling the majority class (SEVERITYCODE=1):
Before starting the model building, let's select randomly a 5% fraction of our data dataset and hold it out, so as to use it eventually as the test dataset for our different ML algorithms.
We continue by encoding the variables that need it, choosing One-Hot-Encoding for some of them
We will just drop all the features who count less than 1/1000 occurences in our dataset.
So, we end up with 26 features
Finally we normalize our features, taking at the same time care of creating 4 different features sets, that is, all possible combinations with or without XY_bin and DISTANCE_bin.
Now, it is time to build our accurate model and then use a test set to report the accuracy. We should use the following algorithm:
in all the four different options:
Instead of a train-test split with a simple out-of-sample validation of the result, we will rather use a 10-fold cross-validation:
Instead of using the Grid (or Random) search embeded in scikit-learn library in order to optimize the hyperparameter k, we will rather use iteration to caclulate different models with different values of k nearest neighbors for the different feature sets (with and without the two spatial features); then we will calculate and plot the corresponding average accuracies based on the cross validation
For proceeding to so many different and computationally expensive enough tests, we will use only a sample fraction (10%) of the whole dataset.
So let's start the accuracy of k values from 1 to 100 with the Xboth feature set:
According to our results, for Xboth feature set it's better to use a k value equal to 32 (elbow). Thus we choose the k to be 30 and we train again our model, this with all our data.
We go on with the Xxy feature set:
According to our results, for Xxy feature set it's better to use a k value equal to 24 (elbow). Thus we choose the k to be 30 and we train again our model with all our data.
We go on with the Xdist feature set:
According to our results, for Xdist feature set it's better to use a k value equal to 23 (elbow). Thus we choose the k to be 23 and we train again our model, this time with all our data.
And we conclude the knn classifier optimization part with the Xnone feature set:
According to our results, for Xnone feature set it's better to use a k value equal to 30. Thus we choose the k to be 30 and we train again our model with all our data.
Here again, we wiil stick to previous approach: we will use iteration to calculate different models with different values of max_depth for our decision trees
From the above results, it becomes clear that we shouldn't use a Decision Tree algorithm Classifier. We trained all datasets with increasing depth and charted out the performances. As apparent from the figure, as the depth the of tree increases, the model accuracy (the metric of evaluation) clearly decreases. This observation concerns of course only the validation and cross-validation datasets. If we had tried to measure the accuracy on the training set itself, it is absolutely certain and logical we would observe a severe gap. The tendance in that case would be the opposite as accuracy for the training dataset increases as the complexity (in our case: max depth) increases. As we can see as the complexity of decision tree increases with increase in tree depth, in turn, overfitting also increases. The fact that Accuracy for the validation set (or CV accuracy) drops significantly as tree depth increases, suggests overfitting but that is in any case an insctance of one of the critical shortcomings of decision trees that any data scientist should be aware about: Decision Trees very easily ovefit.
We will also try the prediction efficiency of the Support Vector Machine algorithm with different kernel functions — radial basis, linear, polynomial and sigmoid:
From the above, it becomes evident it's the linear kernel function that we should use for our Support Vector Machine algortihm Classifier. Thus, accordingly we train again our models with all our data.
Finally, we will try a Logistic Regression approach and we will start by testing the differnet solvers for C=0.01
Even the differences are very small, or exactly due to this, we can safely conclude that we should pich and try to optimize the performance of liblinear solver for all our datasets.
In the meantime, by restraining more and more the C values field, we end up investigating the best value for parameter C in the space [\0.0005, 0.005].
So, it comes up that the ideal C values for all four feature sets are:
And accordingly we train again our models, this time with all the data.
In order to evaluate the selected models, we will use the hold-out test dataset, which we have to preprocess the way (using the same parameters) we did with the training-validation dataset.
We will just drop all the features who count less than 1/1000 occurences in our dataset.
From the above, we can say that we would prefer the Logistic Regression Algorithm, preferably with the dataset that doesn't include at all spatial features. In any case we can see that the predicition algorithms' performance cannot be considered as good. This low performance is getting even more observable if we take into account that for an absolutely balanced binary classification the performance of the worst algorithm, that is blind guessing, has an accuracy of 50%. Our future work should be concentrated on improving variables' exploratory analysis for a better feature extraction and selection. Mainly, though, we whould insist more on the feature engineering exploiting the location data. Clustering (DBSCAN and K-means) has its limitations but there is a lot of interesting and potential work to be done towards this direction, so as to manage to create "practical" clusters based on locaton and car accident severity historical data.
In this study, our goal was to predict accurately the severity type of an accident depending on the given features.The results can have a better performance, a lot of improvement can be done on class 1 and 2 predictions. These models can be very useful in helping weather stations or news program alert drivers of the probabilities of car crashes and its type of severity (damage, injuries, fatality,…).